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Description 

This invention relates to methods of processing an audio signal. 

Consumer, industrial, studio and laboratory products for storing, processing and communicating high quality audio 
s signals are in great demand. For example, so-called compact disc ("CD") and digital audio tape ("DAT") recordings for 
music have largely replaced the long-popular phonograph record and cassette tape. Likewise, recently available digital 
audio tape ("DAT") recordings promise to provide greater flexibility and high storage density for high quality audio 
signals. See, also Tan and Vermeulen, "Digital audiotape for data storage", IEEE Spectrum, pp. 34-38 (Oct. 1989). A 
demand is also arising for broadcast applications of digital technology that offer CD-like quality. 
10 While these emerging digital techniques are capable of producing high quality signals, such performance is often 

achieved only at the expense of considerable data storage capacity or transmission bandwidth. Accordingly, much 
work has been done in an attempt to compress high quality audio signals for storage and transmission. 

Most of the prior work directed to compressing signals for transmission and storage has sought to reduce the 
redundancies that the source of the signals places on the signal. Thus, such techniques as ADPCM, sub-band coding 
15 and transform coding described, e.g., in N.S. Jayant and P. Noll, "Digital Coding of Waveforms," Prentice-Hall, Inc. 
1984, have sought to eliminate redundancies that otherwise would exist in the source signals. 

In other approaches, the irrelevant information in source signals is sought to be eliminated using techniques based 
on models of the human perceptual system. Such techniques are described, e.g., in E.F. Schroeder and J.J. Platte, 
"MSC: Stereo Audio Coding with CD-Quality and 256 kBIT/SEC," IEEE Trans, on Consumer Electronics, Vol. CE-33, 
20 No. 4, November 1 987; and Johnston, Transform Coding of Audio Signals Using Noise Criteria, Vol. 6, No. 2, IEEE J. 
S.C.A. (Feb. 1988). 

Perceptual coding, as described, e.g., in the Johnston paper relates to a technique for lowering required bitrates 
(or reapportioning available bits) or total number of bits in representing audio signals. In this form of coding, a masking 
threshold for unwanted signals is identified as a function of frequency of the desired signal. Then, inter alia, the coarse- 
ns ness of quantizing used to represent a signal component of the desired signal is selected such that the quantizing 
noise introduced by the coding does not rise above the noise threshold, though it may be quite near this threshold. 
The introduced noise is therefore masked in the perception process. While traditional signal-to- noise ratios for such 
perceptually coded signals may be relatively fow, the quality of these signals upon decoding, as perceived by a human 
listener, is nevertheless high. 

30 Brandenburg et al, U.S. Patent 5,040,217, issued August 13, 1991, describes a system for efficiently coding and 

decoding high quality audio signals using such perceptual considerations. In particular, using a measure of the "noise- 
like" or lone-like" quality of the input signals, the embodiments described in the latter system provides a very efficient 
coding for monophonic audio signals. 

It is, of course, important that the coding techniques used to compress audio signals do not themselves introduce 

35 offensive components or artifacts. This is especially important when coding stereophonic audio information where 
coded information corresponding to one stereo channel, when decoded for reproduction, can interfere or interact with 
coding information corresponding to the other stereo channel. Implementation choices for coding two stereo channels 
include so-called "dual mono" coders using two independent coders operating at fixed bit rates. By contrast, "joint 
mono" coders use two monophonic coders but share one combined bit rate, i.e., the bit rate for the two coders is 

40 constrained to be less than or equal to a fixed rate, but trade- offs can be made between the bit rates for individual 
coders. "Joint stereo" coders are those that attempt to use interchannel properties for the stereo pair for realizing 
additional coding gain. 

It has been found that the independent coding of the two channels of a stereo pair, especially at low bit-rates, can 
lead to a number of undesirable psychoacoustic artifacts. Among them are those related to the localization of coding 
45 noise that does not match the localization of the dynamically imaged signal. Thus the human stereophonic perception 
process appears to add constraints to the encoding process if such mismatched localization is to be avoided This 
finding is consistent with reports on binaural masking-level differences that appear to exist, at least for low frequencies, 
such that noise may be isolated spatially. Such binaural masking-level differences are considered to unmask a noise 
component that would be masked in a monophonic system. See, for example, B.CJ. Morre, "An Introduction to the 
50 Psychology of Hearing, Second Edition," especially chapter 5, Academic Press, Orlando, FL, 1982. 

One technique for reducing psychoacoustic artifacts in the stereophonic context employs the ISO-WG11 -MPEG- 
Audio Psychoacoustic II [ISO] Model. In this model, a second limit of signal-to-noise ratio ("SNR") is applied to signal- 
to-noise ratios inside the psychoacoustic model. However, such additional SNR constraints typically require the ex- 
penditure of additional channel capacity or (in storage applications) the use of additional storage capacity, at low fre- 
55 quencies, while also degrading the monophonic performance of the coding. 

According to this invention there is provided a method as claimed in claim 1 . 

Interchannel redundancy and irrelevancy are exploited to achieve lower bit-rates while maintaining high quality 
reproduction after decoding. While particularly appropriate to stereophonic coding and decoding, the advantages of 
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the present invention may also be realized in conventional dual monophonic stereo coders. 

An illustrative embodiment employs a filter bank architecture using a Modified Discrete Cosine Transform (MDCT). 
In order to code the full range of signals that may be presented to the system, the illustrative embodiment advanta- 
geously uses both L/R (Left and Right) and M/S (Sum/Difference) coding, switched in both frequency and time in a 
5 signal dependent fashion. A new stereophonic noise masking model advantageously detects and avoids binaural ar- 
tifacts in the coded stereophonic signal. Interchannel redundancy is exploited to provide enhanced compression for 
without degrading audio quality. 

The time behavior of both Right and Left audio channels is advantageously accurately monitored and the results 
used to control the temporal resolution of the coding process. Thus, in one aspect, an illustrative embodiment provides 
10 processing of input signals in terms of either a normal MDCT window, or, when signal conditions indicate, shorter 
windows. Further, dynamic switching between RIGHT/LEFT or SUM/DIFFERENCE coding modes is provided both in 
time and frequency to control unwanted binaural noise localization, to prevent the need for overcoding of SUM/DIF- 
FERENCE signals, and to maximize the global coding gain. 

A typical bitstream definition and rate control loop are described which provide useful flexibility in forming the coder 
is output. Interchannel irrelevancies, are advantageously eliminated and stereophonic noise masking improved, thereby 
to achieve improved reproduced audio quality in jointly coded stereophonic pairs. The rate control method used in an 
illustrative embodiment uses an interpolation between absolute thresholds and masking threshold for signals below 
the rate-limit of the coder, and a threshold elevation strategy under rate-limited conditions. 

It proves advantageous to employ an improved Huffman- like entropy coder/decoder to further reduce the channel 
20 bit rate requirements, or storage capacity for storage applications. The noiseless compression method illustratively 
used employs Huffman coding along with a frequency-partitioning scheme to efficiently code the frequency samples 
for L, R, M and S, as may be dictated by the perceptual threshold. 

The present invention provides a mechanism for determining the scale factors to be used in quantizing the audio 
signal (i.e., the MDCT coefficients output from the analysis filter bank) by using an approach different from the prior 
25 art, and while avoiding many of the restrictions and costs of prior quantizer/rate-loops. The audio signals quantized 
pursuant to the present invention introduce less noise and encode into fewer bits than the prior art. 

These results are obtained in an illustrative embodiment whereby the utilized scale factor, is iteratively derived by 
interpolating between a scale factor derived from a calculated threshold of hearing at the frequency corresponding to 
the frequency of the respective spectral coefficient to be quantized and a scale factor derived from the absolute thresh- 
30 old of hearing at said frequency until the quantized spectral coefficients can be encoded within permissible limits. 

Brief Description of the Drawings 

FIG. 1 presents an illustrative prior art audio communication/storage system of a type in which aspects of the 
35 present invention find application, and provides improvement and extension ; 

FIG. 2 presents an illustrative perceptual audio coder (PAC) in which the advances and teachings of the present 
invention find application, and provide improvement and extension ; 

FIG. 3 shows a representation of a useful masking level difference factor used in threshold calculations ; 
FIG. 4 presents an illustrative analysis filter bank ; 
40 FIG. 5(a) through 5(e) illustrate the operation of various window functions ; 

FIG. 6 is a flow chart illustrating window switching functionality ; 

FIG. 7 is a block/flow diagram illustrating the overall processing of input signals to derive the output bitstream ; 
FIG. 8 illustrates certain threshold variations ; 

FIG. 9 is a flowchart representation of certain bit allocation functionality ; 
45 FIG. 10 shows bitstream organization ; 

FIGs 11a through 11c illustrate certain Huffman coding operations ; 

FIG. 12 shows operations at a decoder that are complementary to those for an encoder ; 

FIG. 13 is a flowchart illustrating certain quantization operations; and 

FIG. 14(a) through 14(g) are illustrative windows for use with the filter bank of FIG. 4. 

50 

Detailed Description 
1 . Overview 

55 To simplify the present disclosure, the following patents, patent applications and publications are referred to : 

U.S. Patent 5,040,217, issued August 13, 1991 by K. Brandenburg etal, United States Patent No. 5341457; J.D. 
Johnston, Transform Coding of Audio Signals Using Perceptual Noise Criteria, IEEE Journal on Selected Areas in 
Communications, Vol. 6, No. 2 (Feb. 1 988); International Patent Application (PCT) WO 88/01811 , filed March 10, 1 988; 
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United States Patent Application Serial No. 07/491,373, entitled Hybrid Perceptual Coding, tiled March 9, 1990 (cor- 
responding EP-A-0 446 037); Brandenburg et al, Aspec: Adaptive Spectral Entropy Coding of High Quality Music 
Signals, AES 90th Convention (1991); Johnston, J., Estimation of Perceptual Entropy Using Noise Masking Criteria, 
ICASSP, (1988); J. D. Johnston, Perceptual Transform Coding of Wideband Stereo Signals, ICASSP (1989); EE 

s Schroeder and JJ. Platte, "'MSC: Stereo Audio Coding with CD-Quality and 256 kBIT/SEC," IEEE Trans, on Consumer 
Electronics, Vol. CE-33, No. 4, November 1 987; and Johnston, Transform Coding of Audio Signals Using Noise Criteria, 
Vol. 6, No. 2, IEEE J.S.C.A. (Feb. 1988). 

For clarity of explanation, the illustrative embodiment is presented as comprising individual functional blocks (in- 
cluding functional blocks labeled as "processors"). The functions these blocks represent may be provided through the 

10 use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software. (Use 
of the term "processor" should not be construed to refer exclusively to hardware capable of executing software.) Illus- 
trative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP16 or DSP32C, 
and software performing the operations discussed below. Very large scale integration (VLSI) hardware embodiments, 
as well as hybrid DSP/VLSI embodiments, may also be provided. 

is FIG. 1 is an overall block diagram of a system useful for incorporating an illustrative embodiment. At the level 

shown, the system of FIG. 1 illustrates systems known in the prior art, but modifications, and extensions described 
herein will make clear the contributions of the present invention. In FIG. 1, an analog audio signal 101 is fed into a 
preprocessor 102 where it is sampled (typically at 48 KHz) and converted into a digital pulse code modulation ("PCM") 
signal 103 (typically 16 bits) in standard fashion. The PCM signal 103 is fed into a perceptual audio coder 104 ("PAC") 

20 which compresses the PCM signal and outputs the compressed PAC signal to a communications channel/storage 
medium 105. From the communications channel/storage medium the compressed PAC signal is fed into a perceptual 
audio decoder 107 which decompresses the compressed PAC signal and outputs a PCM signal 108 which is repre- 
sentative of the compressed PAC signal. From the perceptual audio decoder, the PCM signal 108 is fed into a post- 
processor 109 which creates an analog representation of the PCM signal 108. 

25 An illustrative embodiment of the perceptual audio coder 104 is shown in block diagram form in FIG. 2. As in the 

case of the system illustrated in FIG. 1, the system of FIG. 2, without more, may equally describe certain prior art 
systems, e.g., the system disclosed in the Brandenburg, et al U.S. Patent 5,040,217. However, with the extensions 
and modifications described herein, important new results are obtained. The perceptual audio coder of FIG. 2 may 
advantageously be viewed as comprising an analysis filter bank 202, a perceptual model processor 204, a quantizer/ 

30 rate- loop processor 206 and an entropy coder 208. 

The filter bank 202 in FIG. 2 advantageously transforms an input audio signal in time/frequency in such manner 
as to provide both some measure of signal processing gain (i.e. redundancy extraction) and a mapping of the filter 
bank inputs in a way that is meaningful in light of the human perceptual system. Advantageously, the well- known 
Modified Discrete Cosine Transform (MDCT) described, e.g., in J. P. Princen and A.B. Bradley, "Analysis/Synthesis 

35 Filter Bank Design Based on Tune Domain Aliasing Cancellation," IEEE Trans. ASSR Vol. 34, No. 5, October, 1986, 
may be adapted to perform such transforming of the input signals. 

Features of the MDCT that make it useful in the present context include its critical sampling characteristic, i.e. for 
every n samples into the filter bank, n samples are obtained from the filter bank. Additionally, the MDCT typically 
provides half- overlap, i.e. the transform length is exactly twice the length of the number of samples, n, shifted into the 

40 filterbank. The half -overlap provides a good method of dealing with the control of noise injected independently into 
each filter tap as well as providing a good analysis window frequency response. In addition, in the absence of quanti- 
zation, the MDCT provides exact reconstruction of the input samples, subject only to a delay of an integral number of 
samples. 

One aspect in which the MDCT is advantageously modified for use in connection with a highly efficient stereophonic 
45 audio coder is the provision of the ability to switch the length of the analysis window for signal sections which have 
strongly non -stationary components in such a fashion that it retains the critically sampled and exact reconstruction 
properties. The US patent application 071 84481 9, abandoned, patent of U.S. application 081 334889 (now US-A-5 592 
584) (referred to hereinafter as the "filter bank application") filed of even date with this application, describes a filter 
bank appropriate for performing the functions of element 202 in FIG. 2. 
50 The perceptual model processor 204 shown in FIG. 2 calculates an estimate of the perceptual importance, noise 

masking properties, or just noticeable noise floor of the various signal components in the analysis bank. Signals rep- 
resentative of these quantities are then provided to other system elements to provide improved control of the filtering 
operations and organizing of the data to be sent to the channel or storage medium. Rather than using the critical band 
by critical band analysis described in J.D. Johnston, "Transform Coding of Audio Signals Using Perceptual Noise 
55 Criteria," IEEE J. on Selected Areas in Communications, Feb. 1988, an illustrative embodiment advantageously uses 
finer frequency resolution in the calculation of thresholds. Thus instead of using an overall tonality metric as in the last- 
cited Johnston paper, a tonality method based on that mentioned in K. Brandenburg and J.D. Johnston, "Second 
Generation Perceptual Audio Coding: The Hybrid Coder," AES 89th Convention, 1 990 provides a tonality estimate that 
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varies over frequency, thus providing a better fit for complex signals. 

The psychoacoustic analysis performed in the perceptual model processor 204 provides a noise threshold for the 
L (Left), R (Right), M (Sum) and S (Difference) channels, as may be appropriate, for both the normal MDCT window 
and the shorter windows. Use of the shorter windows is advantageously controlled entirely by the psychoacoustic 
5 model processor. 

In operation, an illustrative embodiment of the perceptual model processor 204 evaluates thresholds for the left 
and right channels, denoted THR^ and THR r The two thresholds are then compared in each of the illustrative 35 coder 
frequency partitions (56 partitions in the case of an active window-switched block). In each partition where the two 
thresholds vary between left and right by less than some amount, typically 2dB, the coder is switched into M/S mode. 

10 That is, the left signal for that band of frequencies is replaced by M = (L+RJ/2, and the right signal is replaced by S = 
(L-R)/2. The actual amount of difference that triggers the last-mentioned substitution will vary with bitrate constraints 
and other system parameters. 

The same threshold calculation used for L and R thresholds is also used for M and S thresholds, with the threshold 
calculated on the actual M and S signals. First, the basic thresholds, denoted BTHf^ and MLD S are calculated. Then, 

75 the following steps are used to calculate the stereo masking contribution of the M and S signals. 

1. An additional factor is calculated for each of the M and S thresholds. This factor, called MLD m , and MLD S , is 
calculated by multiplying the spread signal energy, (as derived, e.g., in J.D. Johnston, "Transform Coding of Audio 
Signals Using Perceptual Noise Criteria/ IEEE J. on Selected Areas in Communications, Feb. 1988; K. Branden- 

20 burg and J.D. Johnston, "Second Generation Perceptual Audio Coding: The Hybrid Coder," AES 89th Convention, 

1990; and Brandenburg, et al U.S. Patent 5,040,217) by a masking level difference factor shown illustratively in 
FIG. 3. This calculates a second level of detectability of noise across frequency in the M and S channels, based 
on the masking level differences shown in various sources. 

2. The actual threshold for M (THRJ is calculated as THR m = max(BTHR m , min(BTHR s ,MLD s )) and the threshold 
25 m = max(BTHR m ,min(BTHR s ,MLD s )) and the threshold for S is calculated as THFl. = max(BTHR s ,min(BTHR m , 

MLDJ). 

In effect, the MLD signal substitutes for the BTHR signal in cases where there is a chance of stereo unmasking. 
It is not necessary to consider the issue of M and S threshold depression due to unequal L and R thresholds, because 

30 of the fact that L and R thresholds are known to be equal. 

The quantizer and rate control processor 206 used in the illustrative coder of FIG. 2 takes the outputs from the 
analysis bank and the perceptual model, and allocates bits, noise, and controls other system parameters so as to meet 
the required bit rate for the given application. In some example coders this may consist of nothing more than quantization 
so that the just noticeable difference of the perceptual model is never exceeded, with no (explicit) attention to bit rate; 

35 in some coders this may be a complex set of iteration loops that adjusts distortion and bitrate in order to achieve a 
balance between bit rate and coding noise. Also desirably performed by the rate loop processor 206, and described 
in the rate loop application, is the function of receiving information from the quantized analyzed signal and any requisite 
side information, inserting synchronization and framing information. Again, these same functions are broadly described 
in the incorporated Brandenburg, et al, U.S. patent 5,040,217. 

40 Entropy coder 208 is used to achieve a further noise less compress ion in cooperation with the rate control processor 

206. In particular, entropy coder 208 advantageously receives inputs including a quantized audio signal output from 
quantizer/rate-loop 206, performs a lossless encoding on the quantized audio signal, and outputs a compressed audio 
signal to the communications channel/storage medium 106. 

Illustrative entropy coder 208 advantageously comprises a novel variation of the minimum-redundancy Huffman 

45 coding technique to encode each quantized audio signal. The Huffman codes are described, e.g., in DA Huffman, "A 
Method for the Construction of Minimum Redundancy Codes', Proa IRE, 40:1098-1 101 (1 952) and T.M. Cover and J. 
A. Thomas, .us Elements of Information Theory, pp. 92-101 (1991). The useful adaptations of the Huffman codes 
advantageously used in the context of the coder of FIG. 2 are described in more detail in the U.S. patent application 
071844809 (now US-A-5 227 788) by J. D. Johnston and J. Reeds (hereinafter the "entropy coder application") filed 

50 of even date with the present application and assigned to the assignee of this application. Those skilled in the data 
communications arts will readily perceive how to implement alternative embodiments of entropy coder 208 using other 
noiseless data compression techniques, including the well- known Lempel-Ziv compression methods. 

The use of each of the elements shown in FIG. 2 will be described in greater detail in the context of the overall 
system functionality; details of operation will be provided for the perceptual model processor 204. 

55 

2.1. The Analysis Filter Bank 

The analysis filter bank 202 of the perceptual audio coder 104 receives as input pulse code modulated ("PCM") 
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digital audio signals (typically 1 6-bit signals sampled at 48KHz), and outputs a representation of the input signal which 
identifies the individual frequency components of the input signal. Specifically, an output of the analysis filter bank 202 
comprises a Modified Discrete Cosine Transform ("MDCT") of the input signal. See, J. Princen et al, "Sub-band Trans- 
form Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellation,' IEEE ICASSP , pp. 2161-2164 
5 (1987). 

An illustrative analysis filter bank 202 is presented in FIG. 4. Analysis filter bank 202 comprises an input signal 
buffer 302, a window multiplier 304, a window memory 306, an FFT processor 308, an MDCT processor 310, a con- 
catenator 311 , a delay memory 31 2 and a data selector 314. 

The analysis filter bank 202 operates on frames. A frame is conveniently chosen as the 2N PCM input audio signal 
10 samples held by input signal buffer 302. As stated above, each PCM input audio signal sample is represented by M 
bits. Illustratively N = 512 and M = 16. 

Input signal buffer 302 comprises two sections: a first section comprising N samples in buffer locations 1 to N, and 
a second section comprising N samples in buffer locations N+1 to 2N. Each frame to be coded by the perceptual audio 
coder 104 is defined by shifting N consecutive samples of the input audio signal into the input signal buffer 302. Older 
is samples are located at higher buffer locations than newer samples. 

Assuming that, at a given time, the input signal buffer 302 contains a frame of 2N audio signal samples, the suc- 
ceeding frame is obtained by (1) shifting the N audio signal samples in buffer locations 1 to N into buffer locations N+1 
to 2N, respectively, (the previous audio signal samples in locations N+1 to 2N may be either overwritten or deleted), 
and (2) by shifting into the input signal buffer 302, at buffer locations 1 to N, N new audio signal samples from preproc- 
20 essor 1 02. Therefore, it can be seen that consecutive frames contain N samples in common: the first of the consecutive 
frames having the common samples in buffer locations 1 to N, and the second of the consecutive frames having the 
common samples in buffer locations N+1 to 2N. Analysis filter bank 202 is a critically sampled system (i.e., for every 
N audio signal samples received by the input signal buffer 302, the analysis filter bank 202 outputs a vector of N scalers 
to the quantizer/rate-loop 206). 
25 Each frame of the input audio signal is provided to the window multiplier 304 by the input signal buffer 302 so that 

the window multiplier 304 may apply seven distinct data windows to the frame. 

Each data window is a vector of scalers called "coefficients". While all seven of the data windows have 2N coefficients 
(i.e., the same number as there are audio signal samples in the frame), four of the seven only have N/2 non-zero 
coefficients (i.e., one-fourth the number of audio signal samples in the frame). As is discussed below, the data window 

30 coefficients may be advantageously chosen to reduce the perceptual entropy of the output of the MDCT processor 310. 

The information for the data window coefficients is stored in the window memory 306. The window memory 306 
may illustratively comprise a random access memory ("RAM"), read only memory ("ROM"), or other magnetic or optical 
media Drawings of seven illustrative data windows, as applied by window multiplier 304, are presented in FIG. 14. 
As may be seen in FIG. 14, some of the data window coefficients may be equal to zero. 

35 Keeping in mind that the data window is a vector of 2N scalers and that the audio signal frame is also a vector of 

2N scalers, the data window coefficients are applied to the audio signal frame scalers through point-to-point multipli- 
cation (i.e., the first audio signal frame scaler is multiplied by the first data window coefficient, the second audio signal 
frame scaler is multiplied by the second data window coefficient, etc.). Window multiplier 304 may therefore comprise 
seven microprocessors operating in parallel, each performing 2N multiplications in order to apply one of the seven 

40 data window to the audio signal frame held by the input signal buffer 302. The output of the window multiplier 304 is 
seven vectors of 2N scalers to be referred to as "windowed frame vectors". 

The seven windowed frame vectors are provided by window multiplier 304 to FFT processor 308. The FFT proc- 
essor 308 performs an odd-frequency FFT on each of the seven windowed frame vectors. The odd-frequency FFT is 
a Discrete Fourier Transform evaluated at frequencies: 

45 

2N 

so where k = 1, 3, 5 »-, 2N, and f H equals one half the sampling rate. The illustrative FFT processor 308 may comprise 
seven conventional decimation-in-time FFT processors operating in parallel, each operating on a different windowed 
frame vector. An output of the FFT processor 308 is seven vectors of 2N complex elements, to be referred to collectively 
as "FFT vectors". 

FFT processor 308 provides the seven FFT vectors to both the perceptual model processor 204 and the MDCT 
55 processor 310. The perceptual model processor 204 uses the FFT vectors to direct the operation of the data selector 
314 and the quantizer/rate-loop processor 206. Details regarding the operation of data selector 314 and perceptual 
model processor 204 are presented below. 

MDCT processor 31 0 performs an MDCT based on the real components of each of the seven FFT vectors received 
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from FFT processor 308. MDCT processor 310 may comprise seven microprocessors operating in parallel. Each such 
microprocessor determines one of the seven "MDCT vectors" of N real scalars based on one of the seven respective 
FFT vectors. For each FFT vector, F(k), the resulting MDCT vector, X(k), is formed as follows: 

5 X(k) = Re[F(k)]cos[ n(2k+ 4 1 ^ (1+N) ] 1 <*< N. 

The procedure need run konly to N, not 2N, because of redundancy in the result. To wit, for N<k<2N: 

10 X(k) = -X(2N-k). 

MDCT processor 310 provides the seven MDCT vectors to concatenator 311 and delay memory 312. 

As discussed above with reference to window multiplier 304, four of the seven data windows have N/2 non-zero 
is coefficients (see Figure 14c-f). This means that four of the windowed frame vectors contain only N/2 non-zero values. 
Therefore, the non-zero values of these four vectors may be concatenated into a single vector of length 2N by con- 
catenator 311 upon output from MDCT processor 310. The resulting concatenation of these vectors is handled as a 
single vector for subsequent purposes. Thus, delay memory 312 is presented with four MDCT vectors, rather than 
seven. 

20 Delay memory 312 receives the four MDCT vectors from MDCT processor 314 and concatenator 311 for the 

purpose of providing temporary storage. Delay memory 312 provides a delay of one audio signal frame (as defined by 
input signal buffer 302) on the flow of the four MDCT vectors through the filter bank 202. The delay is provided by (/) 
storing the two most recent consecutive sets of MDCT vectors representing consecutive audio signal frames and (//) 
presenting as input to data selector 314 the older of the consecutive sets of vectors. Delay memory 312 may comprise 

25 random access memory (RAM) of size: 

MX2X4XN 

30 where 2 is the number of consecutive sets of vectors, 4 is the number of vectors in a set, N is the number of elements 
in an MDCT vector, and M is the number of bits used to represent an MDCT vector element. 

Data selector 314 selects one of the four MDCT vectors provided by delay memory 31 2 to be output from the filter 
bank 202 to quantizer/rate-loop 206. As mentioned above, the perceptual model processor 204 directs the operation 
of data selector 314 based on the FFT vectors provided by the FFT processor 308. Due to the operation of delay 

35 memory 312, the seven FFT vectors provided to the perceptual model processor 204 and the four MDCT vectors 
concurrently provided to data selector 314 are not based on the same audio input frame, but rather on two consecutive 
input signal frames - the MDCT vectors based on the earlier of the frames, and the FFT vectors based on the later of 
the frames. Thus, the selectbn of a specific MDCT vector is based on information contained in the next successive 
audio signal frame. The criteria according to which the perceptual model processor 204 directs the selection of an 

40 MDCT vector is described in Section 2.2, below. 

For purposes of an illustrative stereo embodiment, the above analysis filterbank 202 is provided for each of the 
left and right channels. 

2.2. The Perceptual Model Processor 

45 

A perceptual coder achieves success in reducing the number of bits required to accurately represent high quality 
audio signals, in part, by introducing noise associated with quantization of information bearing signals, such as the 
MDCT information from the filter bank 202. The goal is, of course, to introduce this noise in an imperceptible or benign 
way. This noise shaping is primarily a frequency analysis instrument, so it is convenient to convert a signal into a 

50 spectral representation (e.g., the MDCT vectors provided by filter bank 202), compute the shape and amount of the 
noise that will be masked by these signals and injecting it by quantizing the spectral values. These and other basic 
operations are represented in the structure of the perceptual coder shown in FIG. 2. 

The perceptual model processor 204 of the perceptual audio coder 104 illustratively receives its input from the 
analysis filter bank 202 which operates on successive frames. The perceptual model processor inputs then typically 

55 comprise seven Fast Fourier Transform (FFT) vectors from the analysis filter bank 202. These are the outputs of the 
FFT processor 308 in the form of seven vectors of 2N complex elements, each corresponding to one of the windowed 
frame vectors. 

In order to mask the quantization noise by the signal, one must consider the spectral contents of the signal and 
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the duration of a particular spectral pattern of the signal. These two aspects are related to masking in the frequency 
domain where signal and noise are approximately steady state -given the integration period of the hearing system- 
and also with masking in the time domain where signal and noise are subjected to different cochlear filters. The shape 
and length of these filters are frequency dependent. 

s Masking in the frequency domain is described by the concept of simultaneous masking. Masking in the time domain 

is characterized by the concept of premasking and postmasking. These concepts are extensively explained in the 
literature; see, for example, E. Zwicker and H. Fasti, "Psychoacoustics, Facts,and Models, " Springer-Verlag, 1990. To 
make these concepts useful to perceptual coding, they are embodied in different ways. 

Simultaneous masking is evaluated by using perceptual noise shaping models. Given the spectral contents of the 

10 signal and its description in terms of noise-like or tone-like behavior, these models produce an hypothetical masking 
threshold that rules the quantization level of each spectral component. This noise shaping represents the maximum 
amount of noise that may be introduced in the original signal without causing any perceptible difference. A measure 
called the PERCEPTUAL ENTROPY (PE) uses this hypothetical masking threshold to estimate the theoretical lower 
bound of the bitrate for transparent encoding. J. D. Johnston, Estimation of Perceptual Entropy Using Noise Masking 

is Criteria; ICASSP, 1989. 

Premasking characterizes the (in)audibility of a noise that starts some time before the masker signal which is 
louder than the noise. The noise amplitude must be more attenuated as the delay increases. This attenuation level is 
also frequency dependent. If the noise is the quantization noise attenuated by the first half of the synthesis window, 
experimental evidence indicates the maximum acceptable delay to be about 1 millisecond. 

20 This problem is very sensitive and can conflict directly with achieving a good coding gain. Assuming stationary 

conditions - which is a false premiss- The coding gain is bigger for larger transforms, but, the quantization error spreads 
till the beginning of the reconstructed time segment. So, if a transform length of 1024 points is used, with a digital signal 
sampled at a rate of 48000Hz, the noise will appear at most 21 milliseconds before the signal. This scenario is partic- 
ularly critical when the signal takes the form of a sharp transient in the time domain commonly known as an "attack". 

25 in this case the quantization noise is audible before the attack. The effect is known as pre-echo. 

Thus, a fixed length filter bank is a not a good perceptual solution nor a signal processing solution for non-stationary 
regions of the signal. It will be shown later that a possible way to circumvent this problem is to improve the temporal 
resolution of the coder by reducing the analysis/synthesis window length. This is implemented as a window switching 
mechanism when conditions of attack are detected. In this way, the coding gain achieved by using a long analysis/ 

30 synthesis window will be affected only when such detection occurs with a consequent need to switch to a shorter 
analysis/synthesis window. 

Postmasking characterizes the (inaudibility of a noise when it remains after the cessation of a stronger masker 
signal. In this case the acceptable delays are in the order of 20 milliseconds. Given that the bigger transformed time 
segment lasts 21 milliseconds (1024 samples), no special care is needed to handle this situation. 

35 

WINDOW SWITCHING 

The PERCEPTUAL ENTROPY (PE) measure of a particular transform segment gives the theoretical lower bound 
of bits/sample to code that segment transparently. Due to its memory properties, which are related to premasking 

40 protection, this measure shows a significant increase of the PE value to its previous value -related with the previous 
segment- when some situations of strong non-stationarity of the signal (e.g. an attack) are presented. This important 
property is used to activate the window switching mechanism in order to reduce pre-echo. This window switching 
mechanism is not a new strategy, having been used, e.g., in the ASPEC coder, described in the ISO/MPEG Audio 
Coding Report, 1990, but the decision technique behind it is new using the PE information to accurately localize the 

45 non-stationarity and define the right moment to operate the switch. 

Two basic window lengths: 1024 samples and 256 samples are used. The former corresponds to a segment du- 
ration of about 21 milliseconds and the latter to a segment duration of about 5 milliseconds. Short windows are asso- 
ciated in sets of 4 to represent as much spectral data as a large window (but they represent a "different" number of 
temporal samples). In order to make the transition from large to short windows and vice-versa it proves convenient to 

so use two more types of windows. A START window makes the transition from large (regular) to short windows and a 
STOP window makes the opposite transition, as shown in FIG. 5b. See the above-cited Princen reference for useful 
information on this subject. Both windows are 1024 samples wide. They are useful to keep the system critically sampled 
and also to guarantee the time aliasing cancellation process in the transition region. 

In order to exploit interchannel redundancy and irrelevancy, the same type of window is used for RIGHT and LEFT 

55 channels in each segment. 

The stationarity behavior of the signal is monitored at two levels. First by large regular windows, then if necessary 
by short windows. Accordingly, the PE of large (regular) window is calculated for every segment while the PE of short 
windows are calculated only when needed. However, the tonality information for both types is updated for every segment 
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in order to follow the continuous variation of the signal. 

Unless stated otherwise, a segment involves 1024 samples which is the length of a large regular window 
The diagram of FIG. 5a represents all the monitoring possibilities when the segment from the point ^ till the point 
is being analyzed. Related to the diagram of Fig. 5 is the flowchart of FIG. 6 which describes the monitoring sequence 
5 and decision technique. We need to keep in buffer three halves of a segment in order to be able to insert a START 
window prior to a sequence of short windows when necessary. FIGs. 5a-e explicitly considers the 50% overlap between 
successive segments. 

The process begins by analysing a "new" segment with 512 new temporal samples (the remaining 512 samples 
belong to the previous segment). As shown in Fig. 6 PE of this new segment and the differential PE to the previous 

10 segment are calculated. If the latter value reaches a predefined threshold, then the existence of a non-stationarity 
inside the current segment is declared and details are obtained by processing four short windows with positions as 
represented in FIG. 5a. The PE value of each short window is calculated resulting in the ordered sequence: PE1 , PE2, 
PE3 and PE4. From these values, the exact beginning of the strong non-stationarity of the signal is deduced. Only five 
locations are possible. They are identified in FIG. 4a as L1, L2, L3, L4 and L5. As it will become evident, if the non- 

'5 stationarity had occurred somewhere from the point ^ till the point A^, that situation would have been detected in the 
previous segment It follows that the PE1 value does not contain relevant information about the stationarity of the current 
segment. The average PE of the short windows is compared with the PE of the large window of the same segment. A 
smaller PE reveals a more efficient coding situation. Thus if the former value is not smaller than the latter, then we 
assume that we are facing a degenerate situation and the window switching process is aborted. 

20 it has been observed that for short windows the information about stationarity lies more on its PE value than on 

the differential to the PE value of the precedent window. Accordingly, the first window that has a PE value larger than 
a predefined threshold is detected. PE2 is identified with location L1 , PE3 with L2 and PE4 with location L3. In either 
case, a START window is placed before the current segment that will be coded with short windows. A STOP window 
is needed to complete the process. There are, however, two possibilities. If the identified location where the strong 

25 non- stationarity of the signal begins is L1 or L2 then, this is well inside the short window sequence, no coding artifacts 
result and the coding sequence is depicted in FIG. 5b. If the location if L4, then, in the worst situation, the non-stationarity 
may begin very close to the right edge of the last short window. Previous results have consistently shown that placing 
a STOP window -in coding conditions- in these circumstances degrades significantly the reconstruction of the signal 
in this switching point For this reason, another set of four short windows is placed before a STOP window. The resulting 

30 coding sequence is represented in FIG. 5e. 

If none of the short PEs is above the threshold, the remaining possibilities are L4 or L5. In this case, the problem 
lies ahead of the scope of the short window sequence and the first segment in the buffer may be immediately coded 
using a regular large window. 

To identify the correct location, another short window must be processed. It is represented in FIG. 5a by a dotted 

35 curve and its PE value, PE1 n+1 , is also computed. As it is easily recognized, this short window already belongs to the 
next segment. If PE1 n+1 is above the threshold, then, the location is L4 and, as depicted in FIG. 5c, a START window 
may be followed by a STOP window. In this case the spread of the quantization noise will be limited to the length of a 
short window, and a better coding gain is achieved. In the rare situation of the location being L5, then the coding is 
done according to the sequence of FIG. 5d. The way to prove that in this case that is right solution is by confirming 

40 that PE2 n+1 will be above the threshold. PE2 n+1 is the PE of the short window (not represented in FIG. 5) immediately 
following the window identified with PE1 n+1 . 

As mentioned before for each segment, RIGHT and LEFT channels use the same type of analysis/synthesis win- 
dow. This means that a switch is done for both channels when at least one channel requires it. 

It has been observed that for low bit rate applications the solution of FIG. 5c, although representing a good local 

45 psychoacoustic solution, demands an unreasonably large number of bits that may adversely affect the coding quality 
of subsequent segments. For this reason, that coding solution may eventually be inhibited. 

It is also evident that the details of the reconstructed signal when short windows are used are closer to the original 
signal than when only regular large window are used. This is so because the attack is basically a wide bandwidth signal 
and may only be considered stationary for very short periods of time. Since short windows have a greater temporal 

50 resolution than large windows, they are able to follow and reproduce with more fidelity the varying pattern of the spec- 
trum. In other words, this is the difference between a more precise local (in time) quantization of the signal and a global 
(in frequency) quantization of the signal. 

The final masking threshold of the stereophonic coder is calculated using a combination of monophonic and ster- 
eophonic thresholds. While the monophonic threshold is computed independently for each channel, the stereophonic 

55 one considers both channels. 

The independent masking threshold for the RIGHT or the LEFT channel is computed using a psychoacoustic model 
that includes an expression for tone masking noise and noise masking tone. The latter is used as a conservative 
approximation for a noise masking noise expression. The monophonic threshold is calculated using the same procedure 
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as previous work. In particular, a tonality measure considers the evolution of the power and the phase of each frequency 
coefficient across the last three segments to identify the signal as being more tone-like or noise-like. Accordingly, each 
psychoacoustic expression is more or less weighted than the other. These expressions found in the literature were 
updated for better performance. They are defined as: 

s 

TMN dB =19.5 + barkl|5 
NMT dB = 6.56-barkfg 

where bark is the frequency in Bark scale. This scale is related to what we may call the cochlear fitters or critical 
bands which, in turn, are identified with constant length segments of the basilar membrane. The final threshold is 
adjusted to consider absolute thresholds of masking and also to consider a partial premasking protection. 
is A brief description of the complete monophonic threshold calculation follows. Some terminology must be introduced 

in order to simplify the description of the operations involved. 

The spectrum of each segment is organized in three different ways, each one following a different purpose. 

1. First, it may be organized in partitions. Each partition has associated one single Bark value. These partitions 
20 provide a resolution of approximately either one MDCT line or 1/3 of a critical band, whichever is wider. At low 

frequencies a single line of the MDCT will constitute a coder partition. At high frequencies, many lines will be 

combined into one coder partition. In this case the Bark value associated is the median Bark point of the partition. 

This partitioning of the spectrum is necessary to insure an acceptable resolution for the spreading function. As will 

be shown later, this function represents the masking influence among neighboring critical bands. 
25 2. Secondly, the spectrum may be organized in bands. Bands are defined by a parameter file. Each band groups 

a number of spectral lines that are associated with a single scale factor that results from the final masking threshold 

vector. 

3. Finally, the spectrum may also be organized in sections. It will be shown later that sections involve an integer 
number of bands and represent a region of the spectrum coded with the same Huffman code book. 



30 



Three indices for data values are used. These are: 



co -» indicates that the calculation is indexed by frequency in the MDCT line domain. 

b -» indicates that the calculation is indexed in the threshold calculation partition domain. In the case where we 
35 do a convolution or sum in that domain, bb will be used as the summation variable, 

n indicates that the calculation is indexed in the coder band domain. 

Additionally some symbols are also used: 

40 1 . The index of the calculation partition, b. 

2. The lowest frequency line in the partition, colow b . 

3. The highest frequency line in the partition, cohigh b . 

4. The median bark value of the partition, bval b . 

5. The value for tone masking noise (in dB) for the partition, TMN b . 
45 6. The value for noise masking tone (in dB) for the partition, NMT b . 

Several points in the following description refer to the "spreading function'. It is calculated by the following method: 



tmpx = 1 .05(j-i), 

Where / is the bark value of the signal being spread, j the bark value of the band being spread into, and tmpx is a 
temporary variable. 

2 

x = 8 minimum((tmpx-.5) -2(tmpx-.5),0) 
Where x is a temporary variable, and minimum(a.b) is a function returning the more negative of a or b. 
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tmpy = 15.811389 + 7.5(tmpx + .474) - 1 7.5(1. +(tmpx + 474) 2 ) 5 
where tmpy is another temporary variable. 

(x + tmpy) 

if (tmpy < - 100) then {sprdngf(i,j)=0} else (sprdngf(i j)=10 10 ). 

Steps In Threshold Calculation 

The following steps are the necessary steps for calculation the SMR n used in the coder. 

1 . Concatenate 51 2 new samples of the input signal to form another 1 024 samples segment. Please refer to FIG . 5a. 

2. Calculate the complex spectrum of the input signal using the O-FFT as described in 2.0 and using a sine window. 

3. Calculate a predicted r and $ 

The polar representation of the transform is calculated, r^ and represent the magnitude and phase com- 
20 ponents of a spectral line of the transformed segment. 

A predicted magnitude, r at and phase, § m are calculated from the preceding two threshold calculation blocks' 
rand (f>: 

25 f„=2r«,(t-l)-r«(t-2) 
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♦ <0 = 2<|>Co(t-l)-<t><D(t-2) 



where t represents the current block number, t-1 indexes the previous block's data, and t - 2 indexes the data from 
the threshold calculation block before that 
4. Calculate the unpredictability measure c^ 
35 , the unpredictability measure, is: 



40 C<D = . >*. x 

r <D +abs(r <0 ) 

5. Calculate the energy and unpredictability in the threshold calculation partitions. 
The energy in each partition, e b , is: 

cohight 

e b = X **» 



and the weighted unpredictability, c b , is: 

whigh* 

Cfc= £ r a> c a> 
<o=colow k 
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6. Convolve the partitioned energy and unpredictability with the spreading function. 



bnux 

ecb b = £ CtoSprdngfCbvalbb^bvalb) 

bb=l 



10 brn&x 

ct b = E CbbSprdngfCbvaltb.bvalb) 

bb=l 



15 



Because ct b is weighted by the signal energy, it must be renornnalized to cb b . 

ct b 

b ~ ecb b 

20 At the same time, due to the non-normalized nature of the spreading function, ecb b should be renormalized and 

the normalized energy en b , calculated. 

ecb b 

en. = 

b rnorm k 



25 
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The normalization coefficient, rnorm b is: 



7. Convert cb b to tb^ 



£ sprdngf(bval bb ,bval b ) 

bb=0 



tb b = - 299-.43log e (cb b ) 



Each tb b is limited to the range of 0<tb b <1 . 
8. Calculate the required SNR in each partition. 

45 TMN b = 19.5 + bval b ||? 

NMT b = 6.56 -bval b !|| 

so Where TMN b is the tone masking noise in dB and NMT b is the noise masking tone value in dB. 

The required signal to noise ratio, is: 



SNR b = tb b TMN b + (1 - tb b )NMT b 



9. Calculate the power ratio. 
The power ratio, bc b , is: 
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bc b = 10 10 



10. Calculation of actual energy threshold, nt^. 



nb b = en b bc b 



11 . Spread the threshold energy over MDCT lines, yielding nb w 

nb k 



nb„ = - 



a (ohigh b -wlow b +1 

1 2. Include absolute thresholds, yielding the final energy threshold of audibility, thr^ 



thr 0) = max(nb a) ,absthrj. 



The dB values must be converted into the energy domain after considering the MDCT normalization actually used. 



13. Pre-echo control 

1 4. Calculate the signal to mask ratios, SMR n . 
The table of "Bands of the Coder" shows 

1 . The index, n, of the band. 

2. The upper index, cohigh n of the band n. The lower index, colow n , is computed from the previous band as 
(ohigh n .-,+1. 

To further classify each band, another variable is created. The width index, width n , will assume a value width n = 1 
if n is a perceptually narrow band, and width n = 0 if n is a perceptually wide band. The former case occurs if 

bval cohigh b - bval C olow b <Dand,en 9 th 

bandlength is a parameter set in the initialization routine. Otherwise the latter case is assumed. 
Then, if (width n = 1), the noise level in the coder band, nband n is calculated as: 



to high. 

£ thr^ 

to=<ok>w. 

nband n = — — -: : ~r» 

G>high n -a>low n + l 



else, 

nband n = minimum(tor tttow , .thr^ ) 

Where, in this case, minimum(a, ... ,z) is a function returning the most negative or smallest positive argument of the 
arguments a...z. 

The ratios to be sent to the decoder, SMR n , are calculated as 
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,0.5 



[12.0*nband ftJ 

SMR = 10.log in (^ , " , ) 

n ^^minimumfabsthr)' 
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It is important to emphasize that since the tonality measure is the output of a spectrum analysis process, the 
analysis window has a sine form for all the cases of large or short segments. In particular, when a segment is chosen 
to be coded as a START or STOP window, its tonality information is obtained considering a sine window; the remaining 
operations, e.g. the threshold calculation and the quantization of the coefficients, consider the spectrum obtained with 
the appropriate window. 

STEREOPHONIC THRESHOLD 



The stereophonic threshold has several goals. It is known that most of the time the two channels sound "alike". 
Thus, some correlation exists that may be converted in coding gain. Looking into the temporal representation of the 
is two channels, this correlation is not obvious. However, the spectral representation has a number of interesting features 
that may advantageously be exploited. In fact, a very practical and useful possibility is to create a new basis to represent 
the two channels. This basts involves two orthogonal vectors, the vector SUM and the vector DIFFERENCE defined 
by the following linear combination: 

20 



25 



SUM 


1 


1 1 




DDF 


2 


1 -1 
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These vectors, which have the length of the window being used, are generated in the frequency domain since the 
transform process is by definition a linear operation. This has the advantage of simplifying the computational load. 

The first goal is to have a more decor related representation ot the two signals. The concentration of most of the 
energy in one of these new channels is a consequence of the redundancy that exists between RIGHT and LEFT 
channels and on average, leads always to a coding gain. 

A second goal is to correlate the quantization noise of the RIGHT and LEFT channels and control the localization 
of the noise or the unmasking effect This problem arises if RIGHT and LEFT channels are quantized and coded inde- 
pendently. This concept is exemplified by the following context: supposing that the threshold of masking for a particular 
signal has been calculated, two situations may be created. First we add to the signal an amount of noise that corre- 
sponds to the threshold. If we present this same signal with this same noise to the two ears then the noise is masked. 
However, if we add an amount of noise that corresponds to the threshold to the signal and present this combination 
to one ear; do the same operation for the other ear but with noise uncorreiated with the previous one, then the noise 
is not masked. In order to achieve masking again, the noise at both ears must be reduced by a level given by the 
masking level differences (MLD). 

The unmasking problem may be generalized to the following form: the quantization noise is not masked if it does 
not follow the localization of the masking signal. Hence, in particular, we may have two limit cases: center localization 
of the signal with unmasking more noticeable on the sides of the listener and side localization of the signal with un- 
masking more noticeable on the center line. 

The new vectors SUM and DIFFERENCE are very convenient because they express the signal localized on the 
center and also on both sides of the listener. Also, they enable to control the quantization noise with center and side 
image. Thus, the unmasking problem is solved by controlling the protection level for the MLD through these vectors. 
Based on some psychoacoustic information and other experiments and results, the MLD protection is particularly critical 
for very low frequencies to about 3KHz. It appears to depend only on the signal power and not on its tonality properties. 
The following expression for the MLD proved to give good results: 



MLD dB (i) = 25.5[cos|^] 



55 where / is the partition index of the spectrum (see [7]), and b(i) is the bark frequency of the center of the partition /. This 
expression is only valid for b(i) < 1 6.0 i.e. for frequencies below 3KHz. The expression for the MLD threshold is given by: 
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■ MLDgfl) 
THR MLD»=C(i)lO 10 

C(i) is the spread signal energy on the basilar membrane, corresponding only to the partition /. 

A third and last goal is to take advantage of a particular stereophonic signal image to extract irrelevance from 
directions of the signal that are masked by that image. In principle, this is done only when the stereo image is strongly 
defined in one direction, in order to not compromise the richness of the stereo signal. Based on the vectors SUM and 
DIFFERENCE, this goal is implemented by postulating the following two dual principles: 

1 . If there is a strong depression of the signal (and hence of the noise) on both sides of the listener, then an increase 
of the noise on the middle line (center image) is perceptually tolerated. The upper bound is the side noise. 

2. If there is a strong localization of the signal (and hence of the noise) on the middle line, then an increase of the 
(correlated) noise on both sides is perceptually tolerated. The upper bound is the center noise. 



However, any increase of the noise level must be corrected by the MLD threshold. 

According to these goals, the final stereophonic threshold is computed as follows. First, the thresholds for channels 
SUM and DIFFERENCE are calculated using the monophonic models for noise-masking-tone and tone-masking-noise. 
The procedure is exactly the one presented in the section entitled steps in Threshold Calculation till step 10. At this 
20 point we have the actual energy threshold per band, nb b for both channels. By convenience, we call them THRn SUM 
and THRn D | F , respectively for the channel SUM and the channel DIFFERENCE. 

Secondly, the MLD threshold for both channels i.e. THRn MLD SUM and THRn MLDD | F , are also calculated by: 

-MLP"d8 

25 _ io 

THRn MLD,SUM = en b.SUM 10 



THRn MLD,DIF= en b.DIF 10 

The MLD protection and the stereo irrelevance are considered by computing: 

nthr suM = MAX[THRn SUM . MIN(THRn DIF , THRn MLDfDIF )J 
nthr DIF = MAX[THRn DIF , MIN(THRn SUM , THRn MLMUM )] 



After these operations, the remaining steps after the 11th, as presented previously are also taken for both channels. 
40 In essence, these last thresholds are f urthe r adjusted to consider the absolute threshold and also a partial premasking 
protection. It must be noticed that this premasking protection was simply adopted from the monophonic case. It con- 
siders a monaural time resolution of about 2 milliseconds. However, the binaural time resolution is as accurate as 6 
microseconds! To conveniently code stereo signals with relevant stereo image based on interchannel time differences, 
is a subject that needs further investigation. Fig. 8 illustrates data relating hereto. 

45 

STEREOPHONIC CODER 

The simplified flowchart of the stereophonic coder is presented in FIG. 7. For each segment of data being analyzed, 
detailed information about the independent and relative behavior of both signal channels may be available through the 
so information given by large and short transforms. This information is used according to the necessary number of steps 
needed to code a particular segment. These steps involve essentially the selection of the analysis window, the definition 
on a band basis of the coding mode (R/L or S/D), the quantization and Huffman coding of the coefficients and scale 
factors and finally, the bitstream composing 

ss Coding Mode Selection 

When a new segment is read, the tonality updating for large and short analysis windows is done. Monophonic 
thresholds and the PE values are calculated according to the technique described above in the section entitled Steps 
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of Threshold calculation, and elsewhere. This gives the first decision about the type of window to be used for both 
channels. 

Once the window sequence is chosen, an orthogonal coding decision is then considered. It involves the choice 
between independent coding of the channels, mode RIGHT/LEFT (FVL) or joint coding using the SUM and DIFFER- 

5 ENCE channels (S/D). This decision is taken on a band basis of the coder. This is based on the assumption that the 
binaural perception is a function of the output of the same critical bands at the two ears. If the threshold at the two 
channels is very different, then there is no need for MLD protection and the signals will not be more decorrelated rf the 
channels SUM and DIFFERENCE are considered. If the signals are such that they generate a stereo image, then a 
MLD protection must be activated and additional gains may be exploited by choosing the S/D coding mode. A conven- 

10 ient way to detect this latter situation is by comparing the monophonic threshold between RIGHT and LEFT channels. 
If the thresholds in a particular band do not differ by more than a predefined value, e.g. 2dB, then the S/D coding mode 
is chosen. Otherwise the independent mode R/L is assumed. Associated which each band is a one bit flag that specifies 
the coding mode of that band and that must be transmitted to the decoder as side information. >From now on it is 
called a coding mode flag. 

*5 The coding mode decision is adaptive in time since for the same band it may differ for subsequent segments, and 

is also adaptive in frequency since for the same segment, the coding mode for subsequent bands may be different. 
An illustration of a coding decision is given in FIG. 13. This illustration is valid for long and also short segments. 

At this point it is clear that since the window switching mechanism involves only monophonic measures, the max- 
imum number of PE measures per segment is 10 (2 channels * [1 large window + 4 short windows]). However, the 

20 maximum number of thresholds that we may need to compute per segment is 20 and therefore 20 tonality measures 
must be always updated per segment (4 channels * [1 large window + 4 short windows]). 

Bit rate Adjustment 

2S it was previously said that the decisions for window switching and for coding mode selection are orthogonal in the 

sense that they do not depend on each other. Independent to these decisions is also the final step of the coding process 
that involves quantization, Huffman coding and bitstream composing; i.e. there is no feedback path. This fact has the 
advantage of reducing the whole coding delay to a minimum value (1 024/48000 = 21 .3 milliseconds) and also to avoid 
instabilities due to unorthodox coding situations. 

30 The quantization process affects both spectral coefficients and scale factors. Spectral coefficients are clustered 

in bands, each band having the same step size or scale factor. Each step size is directly computed from the masking 
threshold corresponding to its band. The quantized values, which are integer numbers, are then converted to variable 
word length or Huffman codes. The total number of bits to code the segment, considering additional fields of the bit- 
stream, is computed Since the bitrate must be kept constant, the quantization process must be iteratively done till that 

35 number of bits is within predefined limits. After the number of bits needed to code the whole segment, considering the 
basic masking threshold, the degree of adjustment is dictated by a buffer control unit. This control unit shares the deficit 
or credit of additional bits among several segments, according to the needs of each one. 

The technique of the bitrate adjustment routine is represented by the flowchart of FIG. 9. It may be seen that after 
the total number of available bits to be used by the current segment is computed, an iterative procedure tries to find a 

40 factor a such that if all the initial thresholds are multiplied by this factor, the final total number of bits is smaller then 
and within an error 5 of the available number of bits. Even if the approximation curve is so hostile that a is not found 
within the maximum number of iterations, one acceptable solution is always available. 

The main steps of this routine are depicted in figures 7 and 9 as follows. First, an interval including the solution is 
found. Then, a loop seeks to rapidly converge to the solution. At each iteration, the best solution is updated. 

45 in order to use the same procedure for segments coded with large and short windows, in this latter case, the 

coefficients of the 4 short windows are clustered by concatenating homologue bands. Scale factors are clustered in 
the same. 

The bitrate adjustment routine calls another routine that computes the total number of bits to represent all the 
Huffman coded words (coefficients and scale factors). This latter routine does a spectrum partioning according to the 

50 amplitude distribution of the coefficients. The goal is to assign predefined Huffman code books to sections of the 
spectrum. Each section groups a variable number of bands and its coefficients are Huffman coded with a convenient 
book. The limits of the section and the reference of the code book must be sent to the decoder as side information. 
See Fig. 11 in this connection. 

The spectrum partioning is done using a minimum cost strategy. The main steps are as follows. First, all possible 

55 sections are defined -the limit is one section per band- each one having the code book that best matches the amplitude 
distribution of the coefficients within that section. As the beginning and the end of the whole spectrum is known, if K 
is the number of sections, there are K-1 separators between sections. The priceio eliminate each separator is com- 
puted. The separator that has a lower price is eliminated (initial prices may be negative). Prices are computed again 
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before the next iteration. This process is repeated till a maximum allowable number of sections is obtained and the 
smallest price to eliminate another separator is higher than a predefined value. See Fig. 10 in this connection. 

Aspects of the processing accomplished by quantizer/rate-loop 206 in FIG. 2 will now be presented. In the prior 
art, rate-loop mechanisms have contained assumptions related to the monophonic case. With the shift from monophonic 
5 to stereophonic perceptual coders, the demands placed upon the rate-loop are increased. 

The inputs to quantizer/rate- loop 206 in FIG. 2 comprise spectral coefficients (i.e., the MDCT coefficients) derived 
by analysis filter bank 202, and outputs of perceptual model 204, including calculated thresholds corresponding to the 
spectral coefficients. 

Quantizer/rate-loop 206 quantizes the spectral information based, in part, on the calculated thresholds and the 
10 absolute thresholds of hearing and in doing so provides a bitstream to entropy coder 208. The bitstream includes 
signals divided into three parts: (1 ) a first part containing the standardized side information; (2) a second part containing 
the scaling factors for the 35 or 56 bands and additional side information used for so-called adaptive- window switching, 
when used (the length of this part can vary depending on information in the first part) and (3) a third part comprising 
the quantized spectral coefficients. 
is A "utilized scale factor", A, is iteratively derived by interpolating between a calculated scale factor and a scale 

factor derived from the absolute threshold of hearing at the frequency corresponding to the frequency of the respective 
spectral coefficient to be quantized until the quantized spectral coefficients can be encoded within permissible limits. 

An illustrative embodiment can be seen in FIG. 13. As shown at 1301 quant izer/rate-loop receives a spectral 
coefficient, C f , and an energy threshold, E, corresponding to that spectral coefficient. As shown at 1303 a "threshold 
20 scale factor", Aq is calculated by 

A 0 = yi2E 

25 An "absolute scale factor", A A , is also calculated based upon the absolute threshold of hearing (i.e., the quietest sound 
that can be heard at the frequency corresponding to the scale factor). Advantageously, an interpolation constant, a, 
and interpolation bounds h and cc^ are initialized to aid in the adjustment of the utilized scale factor. 

a.- . ~ 1 

30 h| 9 h 

Ct low = ° 

35 

ar=a high 

Next, as shown in 1305, the utilized scale factor is determined from: 
40 A=A 0 a XA A (, - alpha) 

Next, as shown in 1307, the utilized scale factor is itself quantized because the utilized scale factor as computed 
above is not discrete but is advantageously discrete when transmitted and used. 

45 

A=Q* 1 (Q(A)) 

Next, as shown in 1 309, the spectral coefficient is quantized using the utilized scale factor to create a "quantized 
so spectral coefficient" Q(C f ,A). 

Q(C f ,A)=NINT(| f ) 



55 



where "Nl NT" is the nearest integer function. Because quantizer/rate loop 206 must transmit both the quantized spectral 
coefficient and the utilized scale factor, a cost, C, is calculated which is associated with how many bits it will take to 
transmit them both. The cost C is calculated as : 
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C=FOO(Q(C f ,A),Q(A)) 

where FOO is a function which, depending on the specific embodiment, can be easily determined by persons having 
5 ordinary skill in the art of data communications. As shown in 1 31 3, the cost, C is tested to determine whether it is in a 
permissible range PR. When the cost is within the permissible range, Q(C f ,A) and Q(A) are transmitted to entropy coder 
208. 

Advantageously, and depending on the relationship of the cost C to the permissible range PR the interpolation 
constant and bounds are adjusted until the utilized scale factor yields a quantized spectral coefficient which has a cost 
10 within the permissible range. Illustratively, as shown in FIG. 13 at 1315, the interpolation bounds are manipulated to 
produce a binary search. Specifically, 

when C>PR, a high = a, 

15 

alternately, 

when C < PR, cc low = a. 

20 

In either case, the interpolation constant is calculated by: 

a k>w + a high 

a = g— =- 

25 iL 

The process then continues at 1305 iteratively until the C comes within the permissible range PR. 
STEREOPHONIC DECODER 

30 

The stereophonic decoder has a very simple structure. Its main functions are reading the incoming bitstream, 
decoding all the data, inverse quantization and reconstruction of RIGHT and LEFT channels. The technique is repre- 
sented in FIG. 12. 

Illustrative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP16 or 
35 DSP32C, and software performing the operations discussed below. Very large scale integration (VLSI) hardware em- 
bodiments, as well as hybrid DSP/VLSI embodiments, may also be provided. 



Claims 

40 

1 . A method of processing an audio signal comprising the steps of: 

digitizing (102) said audio signal to form blocks of samples thereof, 

processing (308) the samples of each of said blocks to form a Fourier transform representation thereof, 
45 generating in response to said Fourier transform representation of each of said blocks a set of perceptual 

thresholds (204), and 

quantizing the representations using a quantization step (206), 
CHARACTERIZED IN THAT 

50 

said generating step comprises generating in response to said Fourier transform representation of each of 
said blocks a signal representing a Modified Discrete Cosine transform representation (31 0) of said Fourier 
transform representation, there being a one-to-one correspondence between individual elements of said per- 
ceptual thresholds and individual elements of said Modified Discrete Cosine transform representation, said 
55 Modified Discrete Cosine transform representation being generated in serial response to the formation of said 

Fourier transform representation, and 

said quantizing step (206) comprises quantizing said individual elements of said Modified Discrete Cosine 
transform representation using a quantization step size that is a function of the corresponding perceptual 
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thresholds. 

2. A method as claimed in claim 1 , wherein said processing step forms a Discrete Fourier Transform representation 
as said Fourier transform. 

5 

3. A method as claimed in claim 1 , wherein said processing step forms a Fast Fourier Transform representation as 
said Fourier transform. 

4. A method as claimed in claim 1 , wherein said processing step forms an odd-frequency Fourier Transform repre- 
10 sentation as said Fourier transform. 

5. A method as claimed in claim 1 , wherein said processing step forms an odd-frequency Fast Fourier Transform 
representation as said Fourier transform. 

15 

Patentanspruche 

1. Verfahren zur Verarbeitung eines Audiosignals mit den folgenden Schritten: 

20 Digitalisieren (102) des besagten Audiosignals zur Bildung von Blocken von Abtastwerten des besagten Au- 

diosignals, 

Verarbeiten (308) der Abtastwerte jedes der besagten Blocke zur Bildung einer Fouriertransformationsdar- 
stellung dieser, 

Erzeugen, als Reaktion auf die besagte Fouriertransformationsdarstellung jedes der besagten Blocke, einer 
25 Menge wahmehmungsbezogener Schwellenwerte (204), und 

Quantisieren der Darstellungen unter Verwendung eines Quant is ierungsschritts (206), 

dadurch gekennzeichnet, daB 

30 der besagte Erzeugungsschritt das Erzeugen, als Reaktion auf die besagte Fouriertransformationsdarstellung 

jedes der besagten Blocke, eines Signals, das eine modifizierte diskrete Cosinustransformationsdarstellung 
(310) der besagten Fouriertransformationsdarstellung darstellt, wobei zwischen einzelnen Elementen der be- 
sagten wahrnehmungsbezogenen Schwellenwerte und einzelnen Elementen der besagten modifizierten dis- 
kreten Cosinustransformationsdarstellung eine eindeutige Zuordnung besteht, wobei die besagte modifizierte 

35 diskrete Cosinustransformationsdarstellung als serielle Reaktion auf das Bilden der besagten Fouriertrans- 

formationsdarstellung erzeugt wird, umfaGt, und 

der besagte Quantisierungsschritt (206) das Quantisieren der besagten einzelnen Elemente der besagten 
modifizierten diskreten Cosinustransformationsdarstellung unter verwendung einer Quantisierungsschrittgro- 
Ge, die eine Funktion der entsprechenden wahrnehmungsbezogenen Schwellenwerte ist, umfaGt. 

40 

2. Verfahren nach Anspruch 1, wobei der besagte Verarbeitungsschritt als die besagte Fouriertransformation eine 
diskrete Fouriertransformationsdarstellung bildet. 

3. Verfahren nach Anspruch 1, wobei der besagte Verarbeitungsschritt als die besagte Fouriertransformation eine 
45 schnelle Fouriertransformationsdarstellung bildet. 

4. Verfahren nach Anspruch 1 , wobei der besagte Verarbeitungsschritt als die besagte Fouriertransformation eine 
Fouriertransformationsdarstellung mit ungeraden Frequenzen bildet. 

50 5. Verfahren nach Anspruch 1, wobei der besagte Verarbeitungsschritt als die besagte Fouriertransformation eine 
schnelle Fouriertransformationsdarstellung mit ungeraden Frequenzen bildet. 



Revendicattons 

55 

1 . Methode de traitement d'un signal audio comprenant les etapes de : 

numerisation (102) dudit signal audio pour former des blocs d'echantillons de celui-ci, 
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traitement (308) des echantillons de chacun desdits blocs pour former une representation par transformed de 
Fourier de ceux-ci, 

generation en reponse a tadite representation par Transformee de Fourier de chacun desdits blocs d'un en- 
semble de seuils perceptifs (204), et 

quantification des representations en utilisant une etape de quantification (206), 

CARACTERISEE EM CE QUE 

ladite etape de generation comprend la generation en r6ponse a ladite representation par transformee de 
Fourier de chacun desdits blocs d'un signal representant une representation par transformee en cosinus dis- 
crete modifiee (310) de ladite representation par transformee de Fourier, une correspondance univoque exis- 
tant entre des elements individuels desdits seuils perceptifs et des elements individuels de ladite representa- 
tion par transformee en cosinus discrete modifiee, ladite representation par transformee en cosinus discrete 
modifiee etant generee en reponse serielle a la formation de ladite representation par transformee de Fourier, 
et 

ladite etape de quantification (206) comprend la quantification desdits elements individuels de ladite repre- 
sentation par transformee en cosinus discrete modifiee en utilisant une taille de pas de quantification qui est 
fonction des seuils perceptifs correspondants. 

Methode selon la revendication 1 , dans laquelie ladite etape de traitement forme une representation par transfor- 
mee de Fourier discrete comme dite transformee de Fourier. 

Methode selon la revendication 1 , dans laquelie ladite etape de traitement forme une representation par transfor- 
mee de Fourier rapide comme dite transformee de Fourier. 

Methode selon la revendication 1 , dans laquelie ladite etape de traitement forme une representation par transfor- 
mee de Fourier de frequences impaires comme dite transformee de Fourier. 

Methode selon la revendication 1 , dans laquelie ladite etape de traitement forme une representation par transfor- 
mee de Fourier rapide de frequences impaires comme dite transformee de Fourier. 
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